Introduction

Not everything we want to measure comes with an obvious yardstick. If one wants to measure something like a person’s happiness, what would they have at their disposal?

  • Are they smiling?
  • Did they just get a pay raise?
  • Are they interacting well with others?
  • Are they relatively healthy?

Any of these might be useful as an indicator of their current state of happiness, but of course none of them would tell us whether they truly are happy or not. At best, they can be considered imperfect measures. If we consider those and other indicators collectively, perhaps we can get an underlying measure of something we might call happiness, contentment, or some other arbitrary but descriptive name.

The above decribes a latent variable, and gives at least one sense of at least one way in which we can think of them. However, models incorporating latent variables are seen all over, and incorporated in many ways. Broadly speaking, factor analysis can be seen as a dimension reduction technique, or as an approach to modeling measurement error and understanding underlying constructs, a means to understand large amounts of text, a way to make movie recommendations, and more.

Some key ideas:

  • Dimension Reduction/Data compression: Many times we simply have the goal of taking a whole lot of variables, reducing them to much fewer, but while retaining as much information about the originals as possible.
  • Matrix Factorization: Many techniques discussed below fall under the broad heading of matrix factorization, which is a way to start with a matrix and decompose it into a product of multiple matrices.
  • Latent Linear Models: Many if not all of those we discuss can be easily expressed as a (multivariate) linear model
  • Measurement error: Some techniques can be used to tell us how well a group of items measure what they are supposed to measure

Each section will provide some suggested packages to serve as starting points for standard implementations. Just note that there will likely be several packages providing enhancements in a variety of fashion.

PCA

Seek to find orthogonal (i.e. uncorrelatd) factors to maximize the variance accounted for by each factor. For some data, X is the \(NxD\) matrix of \(N\) observations of \(D\) variables. We decompose X into two matrices - \(Z\) are the \(N x L\) component scores where \(L <= D\), while \(W\) are the \(DxL\) weights, typically referred to as the factor loading matrix.

Each component, or factor, created accounts for less of the variance in the original data. So while PCA returns the same number of components as there are columns in the original data, they big idea is that we want to use far fewer. However, with all components,

\[X = ZW'\]

Take for example, the MNIST data, which contains 28 by 28 pixel images of the digits 0-9. If we unroll the data to 784 columns, each row represents a single digit. We can see in the following how well we can reconstruct a digit via PCA. Even with only two components we can get a sense of what we’re looking at. With all components the reconstruction is perfect.

Example

Let’s see an example with more digestible results. The following data contain 25 personality items regarding the Big 5 scales. There are 2800 total subjects, and five categories of questions1.

  • agreeableness: a tendency to be compassionate and cooperative rather than suspicious and antagonistic towards others
  • conscientiousness: a tendency to be organized and dependable, show self-discipline, planning
  • openness to experience: appreciation for art, emotion, adventure, unusual ideas, curiosity, and variety of experience.
  • extraversion: energy, assertiveness, sociability and the tendency to seek stimulation in the company of others, and talkativeness.
  • neuroticisim: prone to physiological stress, tendency to experience unpleasant emotions easily, such as anger, anxiety, depression, and vulnerability

We’ll just look at part of this to keep things simple, namely the agreeableness and neuroticism items.

library(psych)
bfi_trim = bfi %>% 
  select(matches('A[1-5]|N[1-5]'))


So let’s do a PCA. We’ll look at two components, though remember that PCA technically returns as many components as there are variables. You’re just seeing part of the results. Note that PCA almost universally requires you to standardize the data, so that each variable has the same variance. Otherwise you’ll just get components reflecting the relative variances. In what follows the PCA is done on the correlation matrix, which amounts to the same thing.

pc = principal(bfi_trim, 2)
pc
Principal Components Analysis
Call: principal(r = bfi_trim, nfactors = 2)
Standardized loadings (pattern matrix) based upon correlation matrix
     RC1   RC2   h2   u2 com
A1  0.09 -0.50 0.25 0.75 1.1
A2  0.03  0.77 0.60 0.40 1.0
A3 -0.01  0.80 0.63 0.37 1.0
A4 -0.07  0.61 0.38 0.62 1.0
A5 -0.16  0.70 0.52 0.48 1.1
N1  0.81 -0.12 0.67 0.33 1.0
N2  0.80 -0.11 0.65 0.35 1.0
N3  0.81 -0.03 0.66 0.34 1.0
N4  0.68 -0.15 0.49 0.51 1.1
N5  0.66  0.05 0.43 0.57 1.0

                       RC1  RC2
SS loadings           2.90 2.40
Proportion Var        0.29 0.24
Cumulative Var        0.29 0.53
Proportion Explained  0.55 0.45
Cumulative Proportion 0.55 1.00

Mean item complexity =  1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09 
 with the empirical chi square  1923.02  with prob <  0 

Fit based upon off diagonal values = 0.91

First focus on the portion of the output where it says SS loadings . The first line is the sum of the squared loadings2 for each component (in this case where we are using a correlation matrix as the basis, summing across all 10 possible components would equal the value of 10). The Proportion Var tells us how much of the overall variance the component accounts for out of all the variables (e.g. 2.9 / 10 = 0.29). The Cumulative Var tells us that 2 components make up over 53% the variance. The others are the same thing just based on the 2 retained components rather than all 10 variables (i.e. the cumulative explained variance would = 1). We can see that the second component accounts for less variance, and this would continue with additional components, where each accounts for a decreasing amount of variance.

Some explanation of the other parts of the output:

  • h2: the amount of variance in the item/variable explained by the (retained) components. It is the sum of the squared loadings, a.k.a. communality. For example, the first reasoning item shares only 38% of its variance with these components.
  • u2: 1 - h2
  • com: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure).
  • RC: these factors are called RC to denote these are rotated components. They’ve been rotated to aid interpretability of the loadings.

Loadings, also referred to as the pattern matrix, in this scenario represent the estimated correlation of an item with its component, and provide the key way in which we interpret the factors. As an example, we can reproduce the loadings by correlating the observed variables with the estimated component scores3.

cor(bfi_trim, pc$scores, use = 'pair') %>% round(2)
     RC1   RC2
A1  0.09 -0.50
A2  0.03  0.77
A3 -0.02  0.80
A4 -0.08  0.62
A5 -0.17  0.70
N1  0.81 -0.12
N2  0.80 -0.12
N3  0.81 -0.04
N4  0.69 -0.15
N5  0.66  0.05

It can be difficult to sort it out just by looking at the values, so we’ll look at it visually. In the following plot, stronger and more positive loadings are indicated by blue (negative are indicated by red), and we can see the different variables associated with different components.


Many would use what’s called a biplot, which shows the both observations and variables on the same plot, in with components/loadings on the axes, the component scores of the data, and the loadings as vectors whose length and direction specify the strength of the loading on a particular component.

The sheering of the data here is just due to the fact we have variables that only take on 6 values and are looking at them in terms of continuous PC scores. Biplots often take a bit to figure out just for two components, especially when one doesn’t have a clean result, and one can’t really get a sense of more than two at time. However, using a graphical model makes clear what’s going on for all retained components. Low loadings have been suppressed for additional clarity.


Interpretation is the fun, but commonly difficult, part. In this case, the variables group like we’d expect4. So we have one component that represents agreeableness and one that represents neuroticism. Note that the smaller negative loading comes from a reverse scored item (‘I am indifferent to the feelings of others’)5.

It’s worth mentioning the naming fallacy at this point. Just because we associate a factor with some concept, doesn’t make it so. In addition, for PCA the goal is not interpretation, but to reduce the data while retaining as much of the variance as possible.

PCA Summary

Why do it?

  • Reduce the dimensions of the data
  • Retain as much variance as possible

Issues

  • It might not be the best choice with some data types
  • It is not the choice to make if you are wanting to understand the covariance in the data (though it will often agree with methods that do)
  • Non-numeric or mixed data types

Other stuff

  • Scale the variables
  • Biplots are ugly

Suggested R packages

  • psych
  • pcaMethods (Biocondudctor)

Factor Analysis

Now let’s examine what is sometimes called common factor analysis, and also sometimes exploratory factor analysis in the social sciences, and even ‘classical’ or ‘traditional’, but typically just factor analysis (FA) everywhere else. While we can use both PCA and FA for similar reasons (dimension reduction) and even have similar interpretations (in terms of loadings), there are some underlying subtleties between the two that provide unique distinctions.

First let’s revisit PCA, where we can depict it conceptually as an approach where we attempt to reproduce the correlation matrix \(R\) in terms of the product of components, represented by our loading matrix \(W\).

\[R = WW'\] and

\[C = XV\]

In other words, each component score \(C\), i.e. the score for a particular observation with regard to the component, is a weighted combination of the \(D\) observed variables \(X\), the weights (with weight matrix \(V\)) of which are determined by the loadings6.

Things are different with factor analysis. Now the causal flow is in the other direction, originating with the latent variables.

\[X \approx ZW\]

Each observed variable \(x\) is a function of the latent variables that it is associated with. In addition, we also take into account the uniquenesses \(\Psi\), or that part which the factors do not explain.

And to reproduce the correlation matrix:

\[R \approx WW'\] \[R = WW' + \Psi\]

So if we just use the loadings from the FA, we cannot reproduce the correlation matrix exactly, we need to add the uniquenesses as well.

What this amounts to conceptually are a few key ideas:

  • Factor analysis focuses on covariance. PCA focuses on variance.
  • Factors are the cause of the observed variables, variables are the cause of components.
  • Factor analysis does not assume perfect measurement of observed variables.

Example

Now let’s do a factor analysis using the same data as we did for the PCA.

fa_model = fa(bfi_trim, 2)
fa_model
Factor Analysis using method =  minres
Call: fa(r = bfi_trim, nfactors = 2)
Standardized loadings (pattern matrix) based upon correlation matrix
     MR1   MR2   h2   u2 com
A1  0.07 -0.36 0.14 0.86 1.1
A2  0.05  0.69 0.47 0.53 1.0
A3  0.03  0.76 0.56 0.44 1.0
A4 -0.05  0.47 0.24 0.76 1.0
A5 -0.12  0.60 0.39 0.61 1.1
N1  0.78 -0.03 0.61 0.39 1.0
N2  0.76 -0.02 0.58 0.42 1.0
N3  0.77  0.05 0.58 0.42 1.0
N4  0.58 -0.08 0.36 0.64 1.0
N5  0.54  0.08 0.29 0.71 1.0

                       MR1  MR2
SS loadings           2.44 1.78
Proportion Var        0.24 0.18
Cumulative Var        0.24 0.42
Proportion Explained  0.58 0.42
Cumulative Proportion 0.58 1.00

 With factor correlations of 
      MR1   MR2
MR1  1.00 -0.19
MR2 -0.19  1.00

Mean item complexity =  1
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  2.82 with Chi Square of  7880.99
The degrees of freedom for the model are 26  and the objective function was  0.23 

The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.05 

The harmonic number of observations is  2759 with the empirical chi square  396.78  with prob <  5.7e-68 
The total number of observations was  2800  with Likelihood Chi Square =  636.27  with prob <  1.6e-117 

Tucker Lewis Index of factoring reliability =  0.865
RMSEA index =  0.092  and the 90 % confidence intervals are  0.085 0.098
BIC =  429.9
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR1  MR2
Correlation of (regression) scores with factors   0.92 0.88
Multiple R square of scores with factors          0.84 0.77
Minimum correlation of possible factor scores     0.68 0.54

The results in terms of loadings are similar to the PCA. But we can see some differences. The PCA components accounted for 53% the variance, while here we only account for 42%. In addition, we have an estimate of the factor correlations, -0.19, where the PCA components are uncorrelated by definition. In addition there are measures of ‘fit’ that come from the structural equation modeling context. See this for more detail on those measures. I will note that BIC is provided, which allows for a way to compare different factor solutions.

FA for measurement

“Confirmatory” Factor Analysis

Some distinguish between exploratory factor analysis and confirmatory factor analysis, but all the latter is doing is constraining some loadings/paths to be zero. Otherwise they are identical techniques.

Why would you set loadings or factor correlations to zero? Theory would suggest it. For example, the agreeableness items shouldn’t be a measure of neuroticism after acounting for the correlation of the factors, we can set things up to coincide with our substantive assumptions.

modelCode = "
  agree =~ A1 + A2 + A3 + A4 + A5
  neurot =~ N1 + N2 + N3 + N4 + N5

  # not necessary
  # agree ~~ neurot

  # if you don't want them correlated
  # agree ~~ 0*neurot
"

cfa_model = cfa(modelCode, data=bfi_trim)
summary(cfa_model, standardized=T, rsq=T, nd=2)


Note that if we only had a single factor, the results from ‘confirmatory’ and ‘exploratory’ FA are identical.

Suggested R packages

  • psych
  • lavaan

As latent linear models

In the following we compare PCA and FA from a probabilistic modeling perspective. This further highlights their differences and similarities.

FA

\[ X \sim \mathcal{N}(ZW' + \mu, \Psi) \] \(\mu\) are the intercepts, \(\Psi\) is a \(DxD\) covariance matrix.

Probabilistic PCA

\[\Psi = \sigma^2I\]

PCA

\[\sigma^2 \rightarrow 0\]

More techniques

Count-based Matrix Factorization

Commonly our data regards counts or otherwise compositional data. In this case we can use techniques that are better suited to such data.

NMF

Non-negative matrix factorization is similar to PCA, just with the constraint that scores and weights be positive. It is (in the usual implementation) identical to probabilistic latent semantic analysis.

LDA

Latent Dirichlet Allocation takes a different approach to deal with count data. The classic example regards text analysis, where LDA is applied to word frequencies across documents. Consider a document-term matrix where each row regards a document, and each column a word or term. LDA looks for latent topics that are probability distributions over the terms. There is a probability distribution for the topics as well as the terms, and given both, we will see some occurrence of the term in each document. Each document can be seen a mixture of topics, though some will be more probable than others.

In the factor analysis sense, each variable (the term) will have some probability of occurring for a given topic, i.e. latent factor. One can think of the term/variable probabilities for a specific topic similar to how we did loadings in the factor analysis sense. Every variable will have some non-zero probability of occurrence for a given factor, but often they are essentially zero. Factor scores are the document topic probabilities.

Given the probabilities of topics and terms within topics, there is a multinomial draw of counts for an observation. With this probabilistic result, there isn’t a ‘reconstruction’ connotation with LDA. However we can simulate a mulinomial draw based on the total counts we see. For example, let’s assume 5 terms seen a total of 20 times for a given document, given some probability of each term.

t(rmultinom(1, size = 20, prob=c(.5, .2, .15, .1, .05))) # first term is most likely, 5th term is least likely
     [,1] [,2] [,3] [,4] [,5]
[1,]    9    6    2    1    2

Those are the counts we see for a given observation. This treats documents as a ‘bag of words’, where we don’t care about ordering. Suprisingly though, simply using the document term matrix and this bag of words approach does very well to uncover latent topics that make sense.

The following shows a text analysis of political blogs using the lda package. As in other factor analysis we’ve discussed, one must choose the number of latent ‘topics’.

library(lda)
data(poliblog.documents)
data(poliblog.vocab)

set.seed(8675309)

K <- 10 ## Num topics

# this will take at least a few seconds to run
result <- lda.collapsed.gibbs.sampler(poliblog.documents,
                                       K,  ## Num topics
                                       poliblog.vocab,
                                       2000,  ## Num iterations
                                       0.1,
                                       0.1) 

## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5)

This shows the topics (identified by their top 5 words) topics and their probability of occurrence in a sample of 10 blogs. For example, blog 1 is more about McCain while blog 6 is more democratic leaning.

See my text analysis document for more details and examples. In my experience, this kind of data comes up quite often, and one should have LDA in their Factor Analysis Toolkit (FAT)™ and as readily available as PCA and FA. LDA wouldn’t be the best tool if your goal is prediction and you don’t care as much about interpretation. But when you do, it can be rewarding to use. Futhermore, there is supervised LDA where one can use the topics to make predictions on outcomes of interest, and structured topic modeling where one can use covariates to model the topic probabilities. In short, you have plenty to play with here.

Here is the number 3 reconstructed with LDA.

Suggested R packages

  • NMF
  • lda
  • topicmodels
  • quanteda
  • text2vec

SEM

Structural Equation Modeling is a very general technique that allows one to incorporate latent variables and observed variables within a single modeling framework, allowing for indirect effects and multiple outcomes. Practically any standard model you’ve conducted can be seen as a special case. We actually already did an SEM when we did the confirmatory factor analysis.

We can use factor analysis in the SEM context as a way to assess measurement error. Whatever variance a factor doesn’t explain is a combination of measurement error and random variation. Thus we can use factor analysis as a means to assess a group of variables’ ability to measure the underlying construct they are supposed to measure.

Suggested R packages

  • lavaan
  • semTools

IRT

Item response theory can be seen as latent variable modeling for categorical data. It is especially applied to testing environments where the items are binary correct vs. not outcomes. An underlying logistic model relates the latent variable, which can be seen as ‘ability’ in the testing context, to the observed items which have unique difficulty. The conceptual latent variable model is identical to confirmatory factor analysis.

As an example, the one-parameter, a.k.a. Rasch, model (1PM) can be expressed as follows:

\[P(y=1|\theta, \delta) = \mathrm{logis}(\theta_i-\delta_j)\]

In this setting, the probability of endorsement (or getting an item correct), \(\pi_{ij}\), is a function of the difficulty of item \(j\), \(\delta_j\) above, and the latent trait (ability) of person \(i\), \(\theta_i\). In other words, it’s a specific type of logistic regression model. In the testing context, a person with more ‘ability’ relative to the item difficulty will answer correctly. In terms of the logit:

\[\mathrm{logit_{ij}} = \mathrm{log}(\frac{\pi_{ij}}{1-\pi_{ij}}) = \theta_i-\delta_j\]

The following shows an example. With the psych package a factor analysis is done on the tetrachoric correlation matrix of the binary verbal reasoning items. Compare with the standard IRT ‘two parameter’ model from ltm. While the parameterizations are different, they come to the same conclusions about ‘ability’. See my IRT chapter for more details.

data("ability", package = 'psych') 
verbal = ability %>% 
  data.frame() %>% 
  dplyr::select(contains('reason')) %>% 
  drop_na()

library(psych); library(ltm)
irt_fa = fa2irt(fa(verbal, fm='ml'), rho=tetrachoric(verbal), plot = F)
irt_model = ltm(verbal ~ z1)
# irt_fa$fa  # to compare to the standard fa output
irt_fa$irt
$difficulty
$difficulty[[1]]
  reason.4  reason.16  reason.17  reason.19 
-0.5649361 -0.7011017 -0.8416509 -0.4474843 


$discrimination

Loadings:
          ML1  
reason.4  0.746
reason.16 0.559
reason.17 0.894
reason.19 0.590

                 ML1
SS loadings    2.018
Proportion Var 0.505
summary(irt_model)

Call:
ltm(formula = verbal ~ z1)

Model Summary:
   log.Lik      AIC      BIC
 -3058.632 6133.263 6175.102

Coefficients:
                   value std.err   z.vals
Dffclt.reason.4  -0.6156  0.0549 -11.2151
Dffclt.reason.16 -0.9684  0.0798 -12.1367
Dffclt.reason.17 -0.7647  0.0533 -14.3422
Dffclt.reason.19 -0.6010  0.0631  -9.5229
Dscrmn.reason.4   1.8781  0.1897   9.9001
Dscrmn.reason.16  1.3847  0.1356  10.2099
Dscrmn.reason.17  2.5396  0.3141   8.0852
Dscrmn.reason.19  1.4163  0.1357  10.4370

Integration:
method: Gauss-Hermite
quadrature points: 21 

Optimization:
Convergence: 0 
max(|grad|): 0.002 
quasi-Newton: BFGS 
# get scores; ltm only has scores for unique patterns
fa_scores = irt_fa$fa$scores %>% unique() %>% na.omit()
fa_scores = data.frame(fa_scores, verbal[rownames(fa_scores),]) 
ltm_scores = ltm::factor.scores(irt_model)[['score.dat']]
score_data = left_join(fa_scores, ltm_scores)

Suggested R packages

  • ltm
  • psych

ICA

The latent linear model versions of PCA and factor analysis assume the observed variables are normally distributed (even standard PCA won’t work nearly as well if the data aren’t. This is not required, and ICA, independent components analysis does not. This visualization duplicates that seen in Murphy (2012) where we have two (uniform) independent sources. We can see that the ICA correctly recovers those components.

Why do it? You believe that truly independent sources of signal underly your data, and that those sources are not normally distributed. You deal with images or sound7.

Suggested R packages

  • fastICA

Multidimensional Scaling

Multidimensional scaling8 is a technique for converting a matrix of observation dissimilarities into a low-dimensional map that preserves those distances as well as possible.

The first step in any MDS is choosing a similarity measure. Often this will be a metric or distance, i.e.:

  • Euclidean Distance for continuous variables
  • Manhattan distance or Jacaard dissimilarity for binary variables
  • Similarity measures can often become dissimilarity measures by inverting (x → 1/x) or subtracting from 1 (x → 1−x).

The basic idea is to come up with what we might call components or factors that minimize the difference in factor scores for two observations from the actual distance.

data(UScitiesD)
UScitiesMDS <- cmdscale(UScitiesD)
Atl Chi Den Hou LA Mia NY SF Sea DC
Atlanta 0 587 1212 701 1936 604 748 2139 2182 543
Chicago 587 0 920 940 1745 1188 713 1858 1737 597
Denver 1212 920 0 879 831 1726 1631 949 1021 1494
Houston 701 940 879 0 1374 968 1420 1645 1891 1220
LosAngeles 1936 1745 831 1374 0 2339 2451 347 959 2300
Miami 604 1188 1726 968 2339 0 1092 2594 2734 923
NewYork 748 713 1631 1420 2451 1092 0 2571 2408 205
SanFrancisco 2139 1858 949 1645 347 2594 2571 0 678 2442
Seattle 2182 1737 1021 1891 959 2734 2408 678 0 2329
Washington.DC 543 597 1494 1220 2300 923 205 2442 2329 0


We can use MDS to obtain a 2-dimensional map preserving distances as well as possible.

Suggested R packages

  • Base R

Recommender Systems

Practically everyone has been exposed to recommender systems such as collaborative filtering and related models. That’s how Netflix, Amazon and others make their recommendations to you given the information you’ve provided about likes and dislikes, what other similar people have provided, and how similar the object of interest is to others.

The following image, taken from Wikipedia (click the image to go there), conceptually shows how a user-based collaborative filtering method would work, where a recommendation is given based on what other similar users have given.

Let’s go with movies as an example. You might only rate a handful, and indeed most people will not rate most movies. But at some point most movies will have been rated. So how can one provide a recommendation for some movie you haven’t seen? If we group similar movies into genres, and similar people into demographic categories, one can recommend something from a similar genre of movies that you like that people in the same demographic category seem to like.

If you think of genres as latent categories of movies, well now we can employ the techniques we’ve talked about9. Similarly we can find clusters of people using cluster analytic techniques. In short, collaborative filtering/recommender systems are using latent variable techniques to a specific type of data, e.g. ratings. More modern approaches will incorporate user and item characteristics, recommendations from other systems, and addition information.

Suggested R packages

  • recommenderlab
  • See also, svd or prcomp in base R

Time-based

Hidden markov models can be used to model latent discrete states that result in a sequence of observations over time. In terms of a graphical model, we can depict it as follows:

In this sense we have latent variable \(z\) that represents the hidden state of a system, while the outcome is what we actually observe. There are three latent states, and the relative width of the edge reflects the transition probability of moving from one state to the other. The \(x\) are the categories of observations we can potentially observe, and there is some probability, given a latent state of seeing a particular category. Such a situation might lead to the following observed sequence of observations:

We start in state 1 where the heart and yellow smiley are most probable. Let’s say we observe ♥. The second state is most likely so we get there, and because it has some probability of staying in that state, we observe § and §. We finally get to state 3 and see , where we go back to state 1, where we see ☺, jump to latent state 3 etc. And such a process continues for the length of sequence we see.

This can be seen as a mixture model/latent class situation that we’ll talk about more later. The outcomes could also be continuous, such that the latent state determines the likelihood of the observation in a manner more directly akin to the latent linear model for standard factor analysis.

Suggested R packages

  • msm
  • HMM
  • HiddenMarkov

Bayesian

Dirichlet Process, CRP, IB

Latent classes

Take a look at the following data. It regards the waiting time between eruptions and the duration of the eruption (both in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.


The visualization suggests probably two clusters- less waiting time results in shorter eruptions, and longer waits mean longer eruptions. However, it seems the observations more or less come from only those two possibilities.

In some contexts, possibly way too many, some assume the latent variables are discrete in nature rather than continuous. In this case, we can think of the dimension reduction occurring along the rows rather than the columns. We’ll talk about two types.

Mixture models

A basic approach for categorical latent variable analysis from a model-based perspective could be seen as follows:

  1. Posit the number of clusters you believe there to be
  2. For each observation, estimate those probability of coming from either cluster assuming some underlying distribution
  3. Assign observations to the most likely class (i.e. the one with the highest probability)

Typical methods are assuming an underlying normal distribution, such that each observation is the weighted some of its likelihood under the parameters for each proposed cluster with their respective estimated parameters. For example with two clusters, each with an estimated mean and variance, we get the likelihood for a particular observation for each cluster. We weight them by its estimated probability of membership in that cluster.

As an example, we’ll use mclust on the old faithful data, positing 2 clusters.

library(mclust)
mod = Mclust(faithful, G = 2)
summary(mod)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust VVE (ellipsoidal, equal orientation) model with 2 components:

 log.likelihood   n df       BIC       ICL
      -1132.187 272 10 -2320.433 -2320.763

Clustering table:
  1   2 
175  97 

We can see that things break out more or less as we might expect, though there is a bit more uncertainty with observations in the middle.

The nice thing about model based approaches is that if we want to compare different solutions we can use the same tools we do in regression settings. Here, mclust does a search over many models (not just differences in cluster sizes, but also things like equal variances, shape, etc.) and selects the ‘best’ model according to BIC10. Interestingly, the result here suggests three clusters for the Old Faithful data.

mod = Mclust(faithful, verbose=F)

For regression

Mixture models are often used in the regression setting, similar to the latent class setting that we’ve been describing. One approach regards the outcome of interest, which you think has latent categories, and you model it as such. In a standard regression setting you would get two sets of output, one for each class, along with estimated probabilities of class membership for each observation.

library(flexmix)
mod = flexmix(eruptions~waiting,  data=faithful, k = 2)
summary(mod)

Call:
flexmix(formula = eruptions ~ waiting, data = faithful, k = 2)

       prior size post>0 ratio
Comp.1 0.498  148    272 0.544
Comp.2 0.502  124    272 0.456

'log Lik.' -191.5743 (df=7)
AIC: 397.1485   BIC: 422.3891 

We can see from the summary about 2/3 are classified to one group. We also get the estimated means and standard deviations for each group, as well as note respective probabilities of each observation for each class. Note that the group labels are completely arbitrary.

parameters(mod)    # means (Intercept) and std dev (sigma) for each group
                      Comp.1      Comp.2
coef.(Intercept) -1.73204354 -2.06336751
coef.waiting      0.07822496  0.07374856
sigma             0.34866389  0.39692190
X1 X2
0.06 0.94
0.141 0.859
0.117 0.883
0.07 0.93
0.464 0.536
0.903 0.097
0.382 0.618
0.003 0.997
0.483 0.517
0.243 0.757


The key idea is as before, that the response is from two distributions, e.g. two normal distributions, each with their own mean, modeled by covariates, and some variance. This very commonly applied in cases where there are many zeros in the data, leading to zero-inflated, hurdle and similar models, where the zeros are modeled by a logistic regression and the counts by Poisson or negative binomial, but the approach is much more general.

“Cluster analysis”

Aside from mixture models, when people use the term ‘cluster analysis’ they are typically referring to distance-based methods, where given a distance matrix that informs how dissimilar observations are from one another, the methods try to create clusters of observations that are similar to one nother.

K-means

K-means cluster analysis is hands-down the most commonly used clustering method. Conceptually it’s fairly straightforward- find \(k\) clusters that minimize the variance of its members from the mean of its members. As such it’s easy to implement in standard data settings.

K-means can actually be seen as a special case of the Gaussian mixture model above (similar to EII in the BIC plot, which is an attempt to find spherical clusters of equal size and shape). It also has connections to PCA and ICA. The general issue is tryin

clus2 = factor(kmeans(faithful, 2)$cluster)
c2 = ggplot(aes(x=waiting, y=eruptions), data=faithful) +
  geom_point(aes(color=clus2)) +
  theme_trueMinimal()

clus3 = factor(kmeans(faithful, 3)$cluster)
c3 = ggplot(aes(x=waiting, y=eruptions), data=faithful) +
  geom_point(aes(color=clus3)) +
  theme_trueMinimal()
cowplot::plot_grid(c2, c3)

Hierarchical

Other methods can be thought of as a clustering the data in a hierarchical. These can start at the bottom of the hierarchy (agglomerative), allowing every observation into its own cluster, and successively combining them. For example, first choose a measure of dissimilarity, and combine the two observations that are most alike, then add one to those or if another pair are closer, make a new cluster. Conversely, one can start with every observation in one cluster, and take the most dissimilar and split it off, continuing on until every observation is in its own cluster.

Practically every decision one has to make (distance, linkage method, cluster determination, general approach) is arbitrary. While these are actually still not uncommonly used, you always have better alternatives. They are fine to use in a quick visualization to sort the data more meaningfully though, as above.

Predictive Modeling

Issues with Mixture Models

Model Selection

Non-model based methods provide no non-arbitrary way of selecting the number of clusters within a particular method, or comparing different approaches to the same solution. Many have been offered, but even in typical settings they will rarely agree. For example, let’s say you want to choose among k-means, k-medoids, and two hierarchical techniques. You’ll need to look at 2:4 clusters, across 4 techniques, based on 3 methods of ‘best’ (e.g. silhoutte score), 2 distance measures, 3 linkage methods (where appropriate), etc. So just starting out you get to run over a hundred models and probably still won’t know which to choose. At least clusterSim will automate it for you.

Latent Expressivity

You are never going to have it recommended that you take continuous outcomes or predictors and categorize them. It is sometimes done as a computational convenience, but if you’re interested in interpretability and inference, it’s almost always the wrong choice.

Now ask, why would you do it with latent variables? Unless there is very strong theory or an actual physical aspect involved (e.g. seizure occurence), you should think hard before assuming a hard classification of observations.

Sometimes one cluster is the best answer

Suggested R packages

  • mclust
  • flexmix
  • kmeanspp
  • Base R

Issues

  • how many latent factors/classes/
  • Interpretation
  • two step models vs. simultaneous estimation for prediction
  • how to treat the data (e.g. numeric vs. categorical)

  1. These descriptions come from the wikipedia page.

  2. They are the eigenvalues of the correlation matrix. In addition, they are the diagonal of the crossproduct of the loading matrix, i.e. \(W'W\).

  3. These results have been rotated, something rarely practiced with PCA and almost always done with FA.

  4. Muddier results are more the rule than the exception in my experience. Not enough initial development is typically done with scales, and when other people use them in other settings, it often ends with disappointment. Just think, many psychological scales are developed on specific clinical samples or freshman psychology students. How well do you think that will generalize?

  5. In almost every case I’ve come across, reverse-scored items are simply poorer items.

  6. My description here is a bit loose. For the technically inclined, \(V\) are the eigenvectors of \(X'X\), or if the data is standardized, this is the same as the eigenvector of the covariance/correlation matrix of \(X\). Technically, the psych package uses weights base on the loadings and covariance matrix via solve(cov(bfi_trim), pc$loadings), which amounts to an equivalent result (even if different in sign/value). For factor analysis this becomes more nuanced. There is no unique way to create a factor score, but the common ones use weights that are a function of the loadings and correlation matrix. For example, the most commonly used (‘Thurstone’ or ‘regression’) are constructed via \(V = R^{-1}W\), where \(R\) is the correlation matrix and \(W\) are the loadings as before.

  7. These days one would probably be more likely to use deep learning techniques for audio and images, at least if prediction was the primary concern.

  8. This section has been gracefully stolen from James Henderson. I personally have never once needed to use this technique, nor was much interested beyond what I knew about it already, so I decided to lift it from him. I can say that ldaVis shows topic relations from LDA using it, so there’s that. James’ workshop on dimension reduction provides a more technical detail and additional examples for PCA, FA and MDS, for those inclined.

  9. Indeed, straight up singular value decomposition, as that used for PCA, was the hot machine learning tool used by Netflix back in the day.

  10. For whatever reason, mclust displays BIC in higher is better format, in contrast to practically every other modeling context I’ve come across.